Data Cleaning: Removing NaN Values and Duplicates

Initial Data Exploration

Removing redundant columns and NaN Values

To find and remove the rows with the NaN (not a number) values we can create a subset of the DataFrame based on where .isna() evaluates to True. We see that NaN values in ratings are associated with no reviews (and no installs), which makes sense.

Look for duplicate rows

We need to provide the column names that should be used in the comparison to identify duplicates.

Preliminary Data Exploration

Create Pie and Donut Charts

We will use: - plotly - a commonly used data visualisation library that you can use in combination with or instead of Matplotlib.

If you’d like to configure other aspects of the chart, that you can’t see in the list of parameters, you can call a method called .update_traces(). In plotly lingo, “traces” refer to graphical marks on a figure. Think of “traces” as collections of attributes. Here we update the traces to change how the text is displayed.

To create a donut 🍩 chart

Modifying the datatype of particular columns

We can remove the comma (,) character - or any character for that matter - from a DataFrame using the string’s .replace() method. Here we’re saying: “replace the , with an empty string”. This completely removes all the commas in the Installs column. We can then convert our data to a number using .to_numeric().

Find number of unique items in a column

Number of apps per category

Bar Chart

How many installs has each category had

Grouped Bar Charts & Box Plots